Author = "Aaron Stephenson"
ASUid = "1222366145"
Unemployment rates are often at the center of many discussions. It can be used in the political sphere to determine how well the country is doing. It is often used to determine how a country is doing economically and on a deep personal level. High unemployment rates can often mean increased suffering of the poeple that live in that area. This data set breaks down the United States of America into it's states and the data covers the years from 1976 until 2022. Each of the states can be viewed individually by year or each year for all the states.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import sklearn
import seaborn as sns
#confusion_matrix and accuracy_score may come in handy
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score
#KNN Classifier and Regression models
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor
#functions to split and scale our data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
#Functions to test my findings
from sklearn.metrics import accuracy_score, precision_score, recall_score
#Allows us to exhaustively search for best hyperparameter using cross-validation
from sklearn.model_selection import GridSearchCV
The data set is read from a local file using the panda read_csv.
The data is to be cleaned in a few different ways. First the column names were long and difficult to continue using, and therefore they were changed into easier and more efficient titles. Second, the unnecessary columns were removed because they were not relevant to the purpose of the report. Finally, the commas were removed to make the numbers easier to work with during the entire process.
df = pd.read_csv('Unemployment_in_America.csv')
df = df.rename(columns = {"Total Civilian Non-Institutional Population in State/Area": "TotCivPop",
"Total Civilian Labor Force in State/Area" : "TotCivLF",
"Percent (%) of State/Area's Population" : "%State/Area Pop",
"Total Employment in State/Area" : "TotEmployedPerState",
"Percent (%) of Labor Force Employed in State/Area" : "%LF_Employed",
"Total Unemployment in State/Area" : "TotalUnemployedPerState",
"Percent (%) of Labor Force Unemployed in State/Area" : "%LF_Unemployed"})
#remove columns FIP code and Month
df = df.drop(df.columns[[0, 3]], axis =1)
df = df.replace(',', '', regex=True)
#display updated df
df
| State/Area | Year | TotCivPop | TotCivLF | %State/Area Pop | TotEmployedPerState | %LF_Employed | TotalUnemployedPerState | %LF_Unemployed | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Alabama | 1976 | 2605000 | 1484555 | 57.0 | 1386023 | 53.2 | 98532 | 6.6 |
| 1 | Alaska | 1976 | 232000 | 160183 | 69.0 | 148820 | 64.1 | 11363 | 7.1 |
| 2 | Arizona | 1976 | 1621000 | 964120 | 59.5 | 865871 | 53.4 | 98249 | 10.2 |
| 3 | Arkansas | 1976 | 1536000 | 889044 | 57.9 | 824395 | 53.7 | 64649 | 7.3 |
| 4 | California | 1976 | 15621000 | 9774280 | 62.6 | 8875685 | 56.8 | 898595 | 9.2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 29887 | Virginia | 2022 | 6862585 | 4470272 | 65.1 | 4330531 | 63.1 | 139741 | 3.1 |
| 29888 | Washington | 2022 | 6254253 | 4015286 | 64.2 | 3832769 | 61.3 | 182517 | 4.5 |
| 29889 | West Virginia | 2022 | 1434789 | 784323 | 54.7 | 752464 | 52.4 | 31859 | 4.1 |
| 29890 | Wisconsin | 2022 | 4753700 | 3068610 | 64.6 | 2976670 | 62.6 | 91940 | 3.0 |
| 29891 | Wyoming | 2022 | 460134 | 293595 | 63.8 | 282247 | 61.3 | 11348 | 3.9 |
29892 rows × 9 columns
The data set was grouped by each state within a specific year. This was done in order to get a basic idea of what each year looked like in order to begin seeing patterns. Each state is identified on the plot in a different color with their "Year" and their "% of Labor Force Unemployed" as the arguments for the plot. This gives us a basic layout over the years for both the country and each state. A basic look shows us that the country as a hole moves together in regards to unemployment, though the individual states percentages differ. It also shows that North Dakota, South Dakota and Nebraska stay at the lowest point nearly unnaffected by the countries rises and fall. However, all states seem to have been affected in 2020, but this is to be expected.
#group date by year and State/Area and get the mean for each state/area
grouped_df = df.groupby(["Year", "State/Area"]).mean()
fig = px.line(grouped_df.reset_index(), x='Year', y='%LF_Unemployed', color='State/Area',
title='Average Value by Year and State/Area')
fig.show()
#display updated dataframe
grouped_df
| %State/Area Pop | %LF_Employed | %LF_Unemployed | ||
|---|---|---|---|---|
| Year | State/Area | |||
| 1976 | Alabama | 57.000000 | 53.166667 | 6.700000 |
| Alaska | 68.466667 | 63.291667 | 7.566667 | |
| Arizona | 59.475000 | 53.666667 | 9.758333 | |
| Arkansas | 57.791667 | 53.808333 | 6.916667 | |
| California | 62.550000 | 56.833333 | 9.150000 | |
| ... | ... | ... | ... | ... |
| 2022 | Virginia | 64.816667 | 63.000000 | 2.833333 |
| Washington | 64.083333 | 61.408333 | 4.175000 | |
| West Virginia | 54.675000 | 52.525000 | 3.916667 | |
| Wisconsin | 65.033333 | 63.100000 | 2.933333 | |
| Wyoming | 63.708333 | 61.433333 | 3.566667 |
2491 rows × 3 columns
In this section there are plots of each individual state using the plotly.express as px feature. Each state can be looked at for a general analysis. There are two seperate plots, one showing employment, and the other unemployment. At the state level we are able to see the entire range of their employment/unemployment across the designed set of years. Some of the states have a large swing across this time frame while others seem to be unaffected by the times. There are many states that do not change but a few percentiles in over 20 years. On the other end of the spectrum there are states, such as Nevada and Hawaii that swing well over 20%. When looking at the above graph, it is easy to see when these major swings happen, but it is not obvious why some states are not nearly as affected. During the pandemic in 2020 seems to be one of the largest influenced time periods, but it is not drastically more than the other upward fluxes. Overall, there are obvious patterns within the data as well as a lot of information that can be looked further into.
fig = px.box(df, x='State/Area', y='%LF_Employed',
title='Distribution (%) of Labor Force Employed in State/Area',
width=1200, height=600, color = 'State/Area')
fig.show()
fig = px.box(df, x='State/Area', y='%LF_Unemployed',
title='Distribution (%) of Labor Force Unemployed in State/Area',
width=1200, height=600, color = 'State/Area')
fig.show()
Up to this point the focus of the data has been in percentages in each state. It is important to look at the total numbers as well. Therefore, the totals for each state will be used in our training and testing data sets. The purpose of this code block is to split the data frame into a 70/30 split for training and testing. The 'X' argument is the feature variables, and the 'y' argument is the target variable. We set the "random_state" to 42 for reproducibility. This block returns four sets of data for the 'X' train and test as well as the 'y' train and test. They are then printed to ensure that the sizes are the same across both arguments.
X = df[['TotCivLF', 'TotEmployedPerState', 'TotalUnemployedPerState']] # Features
y = df['State/Area'] # Target variable
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print("X_train size:", X_train.shape)
print("X_test size:", X_test.shape)
print("y_train size:", y_train.shape)
print("y_test size:", y_test.shape)
X_train size: (20924, 3) X_test size: (8968, 3) y_train size: (20924,) y_test size: (8968,)
The following code block is used to scale and transform the data sets. 'StandardScaler' is used to scale the data. Then 'fit_transform' is used to fit the scaler object to the training data and use it to transform the data. Then, we use 'transform' on the testing data to transform it using the same scaling facors as the training data. This is done to ensure that we are not introducing any bias in our model due scales of the features in the training and testing data.
It is important to not that we are only scaling the features in the data and not the target variable 'y'.
#Create a StandardScaler object
scaler = StandardScaler()
#Fit the scaler to the training data and transform it
X_train_scaled = scaler.fit_transform(X_train)
#Use the fitted scaler to transform the testing data
X_test_scaled = scaler.transform(X_test)
First, the 'param_grid' variable is created in order to specify the range of hyperparameters to search over. The number of neihbors in the KNN classifier is used to determine the range to search over. Second, the 'GridSearchCV' function performs a cross-validated grid search over the hyperparameter spaca with a 5-fold cross-validation. The next step is to fit the 'grid_search' object to the training data. The hyperparameters that resuled in the best cross-validation performance as then stored in 'k_best.' These hyperparameters are used to train a new KNN classifier, which is stored in 'knn_best.' Finally, accuracy, precision, and recall scores are used to evaluate the performance fo the best KNN classifier on the testing dataset. They are printed in order to check the results. From these results we determine that 'k_best' has been determined to be three, and we get scores from the functions stated above.
#Define the hyperparameters to search over
param_grid = {'n_neighbors': range(1, 31)}
#Create a KNN classifier object
knn = KNeighborsClassifier()
#Create a grid search object to find the optimal k
grid_search = GridSearchCV(knn, param_grid=param_grid, cv=5)
#Fit the grid search object to the training data
grid_search.fit(X_train, y_train)
#Get the best hyperparameters from the grid search
k_best = grid_search.best_params_['n_neighbors']
#Train the KNN classifier using the best k value
knn_best = KNeighborsClassifier(n_neighbors=k_best)
knn_best.fit(X_train, y_train)
#Make predictions on the test set using the best KNN classifier
y_pred = knn_best.predict(X_test)
#Evaluate the performance of the best KNN classifier
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average = 'weighted')
recall = recall_score(y_test, y_pred, average = 'weighted')
print(f"Accuracy of KNN with k={k_best}: {accuracy:.3f}")
print(f"Precision of KNN with k={k_best}: {precision:.3f}")
print(f"recall of KNN with k={k_best}: {recall:.3f}")
Accuracy of KNN with k=3: 0.712 Precision of KNN with k=3: 0.714 recall of KNN with k=3: 0.712
Create a KNN model with the optimal value of k (k_best), and then perform 5-fold, 10-fold, and 15-fold cross-validation on the training set using the cross_val_score function. We print the mean cross-validation score for each fold to get an estimate of the generalization performance of the model.
from sklearn.model_selection import cross_val_score
#Create KNN model with optimal k
knn = KNeighborsClassifier(n_neighbors=k_best)
#Perform cross-validation
scores5 = cross_val_score(knn, X_train, y_train, cv=5)
scores10 = cross_val_score(knn, X_train, y_train, cv=10)
scores15 = cross_val_score(knn, X_train, y_train, cv=15)
#Print the mean cross-validation score
print("Mean cross-validation score 5 folds: {:.3f}".format(scores5.mean()))
print("Mean cross-validation score 10 folds: {:.3f}".format(scores10.mean()))
print("Mean cross-validation score 15 folds: {:.3f}".format(scores15.mean()))
Mean cross-validation score 5 folds: 0.703 Mean cross-validation score 10 folds: 0.714 Mean cross-validation score 15 folds: 0.715
Now a Naive Bayes classifier will be used for comparison against the KNN classifier. Accuracy will be used for measuring the overall performance. Precision, recall, and f1-score will be used to get a more complete picture. Precision is used to measure the proportion of true positives among all positive predictions. Recall is used measure the proportion of true positives that were correctly identified. The F1-score is used inn order to compare the two together. After that a confusion matrix is used in order to show all true positives, false positives, true negatives, and false negatives for our classifier. Finally, the classification report is printed out in order to view the overall performance and scores. Overall, the Naive Bayes classifier seemed to be a more accurate classifier.
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import make_classification
#Generate a random classification dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, random_state=42)
#Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
#Create a Naive Bayes classifier
clf = GaussianNB()
#train the classifier on the training data
clf.fit(X_train, y_train)
#Make predictions on the testing data
y_pred = clf.predict(X_test)
#Evaluate the performance of the classifier
accuracy = clf.score(X_test, y_test)
print('Accuracy:', accuracy)
#Create a confusion matrix
cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:\n', cm)
#Create a classification report
cr = classification_report(y_test, y_pred)
print('Classification Report:\n', cr)
Accuracy: 0.87
Confusion Matrix:
[[150 8]
[ 31 111]]
Classification Report:
precision recall f1-score support
0 0.83 0.95 0.88 158
1 0.93 0.78 0.85 142
accuracy 0.87 300
macro avg 0.88 0.87 0.87 300
weighted avg 0.88 0.87 0.87 300
This code is testing the performance of a KNN regression model on a reandomly generated dataset with 100 samples and 10 features. This will aim to find the optimal value of k that gives the loest RMSE of the model. The code defines a range to test and number of folds for cross-validation. These values are then stores the RMSE scores for each value of k and each number of folds. Loop over the folds for each value of k. A KNN regressor is computes the RMSE score using cross-validation, and appends the mean RMSE score to the list of RMSE scores. Then a line plot is created for each number of folds and adds a polynomial regression line. The plot helps identify the optimal value of k that gives the lowest RMSE score for the KNN model.
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score
#Generate a random regression dataset
X, y = make_regression(n_samples=100, n_features=10, noise=0.2, random_state=42)
#Define the range of values of k to test
k_values = range(1, 30)
#Initialize a list to store the RMSE scores for each value of k and each number of folds
scores = []
cv_values = [5, 10, 15]
# Loop over each number of folds
for cv in cv_values:
cv_scores = []
# Loop over each value of k
for k in k_values:
#Create a KNN regressor with k neighbors
knn = KNeighborsRegressor(n_neighbors=k)
#Compute the RMSE score using cross-validation with cv folds
rmse_scores = -cross_val_score(knn, X, y, cv=cv, scoring='neg_root_mean_squared_error')
#Append the mean RMSE score to the list
cv_scores.append(rmse_scores.mean())
#Append the list of RMSE scores to the list of scores for all folds
scores.append(cv_scores)
print(np.mean(X), np.mean(y))
#Create a line plot of the RMSE scores for each number of folds
plt.figure(figsize=(20, 12))
plt.plot(k_values, scores[0], label='CV=5')
plt.plot(k_values, scores[1], label='CV=10')
plt.plot(k_values, scores[2], label='CV=15')
#Add a polynomial regression line to the plot
x = np.array(k_values)
y_fit = np.poly1d(np.polyfit(x, scores[0], 3))(x)
plt.plot(x, y_fit, label='Regression line', linestyle='--')
plt.xlabel('Number of neighbors (k)')
plt.ylabel('RMSE')
plt.title('KNN RMSE scores for different number of folds')
plt.legend()
plt.show()
0.019332055822325507 11.026524280596613
This code starts by defining a range of training set sizes to use in cross-validation. The data is shuffled and split into training and test sets. The code loops through each training set and randomly selects a subset of training data, and trains the classifier on the subset. The accuracy and standard deviation is then calculated for the training and testing sets. Finally, the code plots the accuracy of the training samples and the standard deviation of the testing accuracy. This allows us to see how the accuracy changes as more data is used for training.
from sklearn.model_selection import ShuffleSplit
#Define the training set sizes to use in cross-validation
train_sizes = np.linspace(0.1, 1.0, 10)
#Create a ShuffleSplit object for cross-validation
cv = ShuffleSplit(n_splits=100, test_size=0.3, random_state=42)
#Create empty lists to store the accuracies and standard deviations
train_scores = []
test_scores = []
train_stds = []
test_stds = []
#Loop through each training set size
for train_size in train_sizes:
# Calculate the number of training samples to use
n_train = int(len(X_train) * train_size)
#Randomly select n_train samples from X_train and y_train
rng = np.random.default_rng(42)
indices = rng.choice(len(X_train), size=n_train, replace=False)
X_train_small = X_train[indices]
y_train_small = y_train[indices]
#Train the classifier on the new training set
clf.fit(X_train_small, y_train_small)
#Calculate the training and testing accuracies and standard deviations
train_score, train_std = clf.score(X_train_small, y_train_small), 0.0
test_scores_fold = cross_val_score(clf, X_test, y_test, cv=cv)
test_score, test_std = np.mean(test_scores_fold), np.std(test_scores_fold)
#Append the accuracies and standard deviations to the lists
train_scores.append(train_score)
test_scores.append(test_score)
train_stds.append(train_std)
test_stds.append(test_std)
print("Train Score: ", np.mean(train_scores),"Test Score: ", np.mean(test_scores))
# Plot the results
plt.figure(figsize=(20, 12))
plt.plot(train_sizes * len(X_train), train_scores, label='Training accuracy')
plt.plot(train_sizes * len(X_train), test_scores, label='Testing accuracy')
plt.fill_between(train_sizes * len(X_train), np.array(test_scores) - np.array(test_stds), np.array(test_scores) + np.array(test_stds), alpha=0.2)
plt.xlabel('Number of training samples')
plt.ylabel('Accuracy')
plt.title('Naive Bayes classifier accuracy')
plt.legend()
plt.show()
Train Score: 0.8366955782312925 Test Score: 0.865777777777778